Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences
نویسندگان
چکیده
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Several different quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
منابع مشابه
Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences [version 2; referees: 2 approved]
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transc...
متن کاملDifferential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material
The sim2 data set consists of simulated sequencing reads from the human chromosome 1. The sequencing parameters as well as underlying TPM values for the 15,677 transcripts in one of the two simulated conditions were estimated using RSEM v1.2.21 [6] from the ERS326990 sample from the ArrayExpress data set with accession number E MTAB 1733. We simulated three biological replicates from each of tw...
متن کاملGene-level differential analysis at transcript-level resolution
Gene-level differential expression analysis based on RNA-Seq is more robust, powerful and biologically actionable than transcript-level differential analysis. However aggregation of transcript counts prior to analysis results can mask transcript-level dynamics. We demonstrate that aggregating the results of transcript-level analysis allow for gene-level analysis with transcript-level resolution...
متن کاملFlexible expressed region analysis for RNA-seq with derfinder
Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolu...
متن کاملLength bias correction for RNA-seq data in gene set analyses
MOTIVATION Next-generation sequencing technologies are being rapidly applied to quantifying transcripts (RNA-seq). However, due to the unique properties of the RNA-seq data, the differential expression of longer transcripts is more likely to be identified than that of shorter transcripts with the same effect size. This bias complicates the downstream gene set analysis (GSA) because the methods ...
متن کامل